Machine learning for biology part 3

Introduction

In parts one and two we considered a dataset consisting of bill and flipper length for penguins:

import pandas as pd
import seaborn as sns

df = (
    pd.read_csv(
    "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv",
    ).dropna() # missing data will confuse things
)[['species', 'bill_length_mm', 'flipper_length_mm']]


sns.relplot(
    data = df,
    x = 'bill_length_mm',
    y = 'flipper_length_mm',
    hue = 'species',
    height=8,
    hue_order = ['Adelie', 'Gentoo', 'Chinstrap']
)

and implemented a function that could guess the species of a new penguin based on its measurements by finding the penguins with most similar measurements:

import numpy as np

def guess_species(bill_length, flipper_length):
    
    # calculate distances and add to the dataframe
    flipper_length_difference = df['flipper_length_mm'] - flipper_length
    bill_length_difference = df['bill_length_mm'] - bill_length
    overall_distance = np.sqrt(                             
        flipper_length_difference ** 2   
        + bill_length_difference ** 2    
    )
    df['distance to new point'] = overall_distance
    
    # find closest points and calculate most common species
    most_common_species = (
        df.sort_values('distance to new point') 
        .head(10)                               
        ['species']                             
        .mode()[0]                              
    )
    
    # the most common species is our guess
    return most_common_species

guess_species(45, 211)

'Gentoo'

If you haven’t read parts one and two of this series, or if it’s been a while, then I recommend at least skimming them so that you will be able to jump in here. And as a reminder, all of the pandas/seaborn magic that we are using for data processing and visualization are explained in the Biological Data Exploration book.

Naming things

Before we start with anything new, we should introduce a few specific terms to bring our descriptions into line with machine learning terminology:

the function that we wrote is a classifier, since its job is to start with measurements and generate a label
the output of the function is its prediction (this sounds a bit more scientific than guess)
the collection of penguin measurements that the function uses is the training set
the process of inputting the training set (which in this case we do by simply having the function use our existing dataframe) is called fitting
the training set consists of two parts: the measurements which we call features, and the species names which we call classes. (I know this is confusing since in biology class is a taxonomic level, but we will ignore that for now)

It might be interesting at this point to note that although we’ve implemented this function to work specifically for our dataset, the general approach - to predict the class of a new data point, count the classes of the closest points and pick the class that appears most often - seems like it would work for other pairs of measurements. And if we can figure out a way to calculate distances between points in more than 2 dimensions, it might even work for data points with more than two measurements. We will come back to this idea later in this series….for now, back to the function that is in front of us.

The most interesting bit of terminology is the name of the algorithm that we have implemented in the function - ~K nearest neighbors. I will use the American spelling for neighbors as that’s how the function name is spelled in various machine learning libraries.

Choosing parameters

The K bit of the name is because there is a parameter that we can tweak in this algorithm: the number of closest points that we want to consider when making our prediction. In the version of the function above we have used 10, but that’s an arbitrary choice, and we can easily rewrite the function so that k is an argument:

def guess_species(bill_length, flipper_length, k):
    
    # calculate distances and add to the dataframe
    flipper_length_difference = df['flipper_length_mm'] - flipper_length
    bill_length_difference = df['bill_length_mm'] - bill_length
    overall_distance = np.sqrt(                             
        flipper_length_difference ** 2   
        + bill_length_difference ** 2    
    )
    df['distance to new point'] = overall_distance
    
    # find closest points and calculate most common species
    most_common_species = (
        df.sort_values('distance to new point') 
        .head(k)                               
        ['species']                             
        .mode()[0]                              
    )
    
    # the most common species is our guess
    return most_common_species

We can now fairly easy find sets of measurements for which we get different predictions depending on which value of K we pick:

print("k=2 : ", guess_species(44, 210, 2))
print("k=5 : ", guess_species(44, 210, 5))

k=2 :  Adelie
k=5 :  Gentoo

This raises an obvious question: which value of K should we pick to get the most accurate results from our classifier function? To answer it we need to look in a bit more detail about scoring a classifier. We have already seen one way to do this in part two: we wrote some code to run the prediction function on all of the real life penguin measurements, then compare the predictions to the true species and ask what fraction of the time they matched:

# generate predicted species for k=10
predictions = df.apply(
    lambda p : 
        guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10)
    , axis=1
)

# compare predictions to true species and get the fraction that match
(predictions == df['species']).value_counts(normalize=True)[True]

0.9519519519519519

Chapter 13 of the Biological data exploration book has more details on how apply works.

So it seems like a trivial matter to try this for a bunch of different values of k and just pick the highest score:

for k in range(1, 10):
    # generate predicted species 
    predictions = df.apply(
        lambda p : 
            guess_species(
                p['bill_length_mm'], 
                p['flipper_length_mm'], 
                k # this will be different each time
            )
        , axis=1
    )

    # compare predictions to true species and get the fraction that match
    score = (predictions == df['species']).value_counts(normalize=True)[True]
    print(k, score)

1 1.0
2 0.963963963963964
3 0.9669669669669669
4 0.963963963963964
5 0.960960960960961
6 0.960960960960961
7 0.960960960960961
8 0.954954954954955
9 0.948948948948949

However, under this scoring scheme, a k value of 1 will always give a perfect score. It’s easy to see why when we think about the test data: because each point we are testing is actually in the dataset, the closest neighbor will always be the point itself. So by setting k to 1 the only point that will be used in the prediction will be the same one we are testing. In other words, we are testing the function by measuring how well it can predict the classes of data points that it has already seen.

In machine learning, this is a big mistake. After all, the whole goal of building a classifier is to get reliable predictions for new points (in our case, new penguins). So to fairly measure the performance, we should test the function by using it to generate predictions for measurements where we know the correct species, but which are not in the training set.

Training and testing

The easist way to do this is to randomly assign each penguin (i.e. each row in our dataframe) to be used either for training or testing, but not both. One way to do this is by adding a new column to the dataframe, which will be ‘training’ 80% of the time and ‘testing’ 20% of the time. numpy can do most of the work for us with its random.choice function:

df['assignment'] = np.random.choice(
    ['training', 'testing'],    # these are the items to randomly choose
    333,                        # this is the number of choices we want
    p=[0.8, 0.2]                # these are the probabilities of picking each item 
)
df

	species	bill_length_mm	flipper_length_mm	distance to new point	assignment
0	Adelie	39.1	181.0	20.302955	training
1	Adelie	39.5	186.0	16.077624	training
2	Adelie	40.3	195.0	10.344564	training
4	Adelie	36.7	193.0	14.396180	training
5	Adelie	39.3	190.0	13.520725	training
...	...	...	...	...	...
339	Chinstrap	55.8	207.0	10.600000	training
340	Chinstrap	43.5	202.0	7.803204	training
341	Chinstrap	49.6	193.0	5.035871	training
342	Chinstrap	50.8	210.0	12.014991	training
343	Chinstrap	50.2	198.0	0.000000	training

333 rows × 5 columns

and if we do value_counts on the resulting column we can see the actual numbers:

df['assignment'].value_counts()

training    251
testing      82
Name: assignment, dtype: int64

Since the assignments are generated randomly we’ll get slightly different absolute numbers each time we run the code, but the split will always be roughly 80%/20%. These percentages are abitrary of course; we could instead choose to use 10% or 30% of the data for testing.

Now we can use the new column to split the penguins up into a training dataset and a testing dataset, which we can store in variables for convenience:

training_data = df[df['assignment'] == 'training']
test_data = df[df['assignment'] == 'testing']
len(training_data), len(test_data)

(251, 82)

These new dataframes are simply subsets of the original rows. Now we are prepared to do a fairer test of the function. We’ll replace the reference to df with a reference to training_data, which will make sure that the function can only “see” the species of the points in the training set:

def guess_species(bill_length, flipper_length, k):
    
    # calculate distances and add to the dataframe
    flipper_length_difference = training_data['flipper_length_mm'] - flipper_length
    bill_length_difference = training_data['bill_length_mm'] - bill_length
    overall_distance = np.sqrt(                             
        flipper_length_difference ** 2   
        + bill_length_difference ** 2    
    )
    training_data['distance to new point'] = overall_distance
    
    # find closest points and calculate most common species
    most_common_species = (
        training_data.sort_values('distance to new point') 
        .head(k)                               
        ['species']                             
        .mode()[0]                              
    )
    
    # the most common species is our guess
    return most_common_species

And when we run our testing code, we will only use the test data:

# generate predicted species for k=10
predictions = test_data.apply(
    lambda p : 
        guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10)
    , axis=1
)

# compare predictions to true species and get the fraction that match
(predictions == test_data['species']).value_counts(normalize=True)[True]

0.9634146341463414

Take a look at chapter 3 of the Biological data exploration book for a bit more on value_counts.

In this version of the test, the training data and testing data are completely separate - we use one set of data points to train the classifier (i.e. to find the nearest neighbors) and an entirely different set of data points to test it. In other words, we are now asking how well the function can predict the species of a penguin it has never seen before, which is a much closer scenario to real world use.

Let’s now put the new testing code in a loop, and calculate a score for each value of k between 1 and 10:

for k in range(1, 10):
    # generate predicted species 
    predictions = test_data.apply(
        lambda p : 
            guess_species(p['bill_length_mm'], p['flipper_length_mm'], k)
        , axis=1
    )

    # compare predictions to true species and get the fraction that match
    score = (predictions == test_data['species']).value_counts(normalize=True)[True]
    print(k, score)

1 0.926829268292683
2 0.9390243902439024
3 0.975609756097561
4 0.9634146341463414
5 0.975609756097561
6 0.975609756097561
7 0.975609756097561
8 0.975609756097561
9 0.975609756097561

Finally, it seems like we have a clear answer: for this particular problem, we get the most accurate predictions with k=5 (i.e. by considering the five closest points when trying to classify a new point), with worse predictions if we make k either higher or lower.

Unfortunately, it’s a bit more complicated than that. The result we have obtained is only true for the specific split of data into training and testing subsets. This will be obvious if you run the code yourself - chances are, you’ll get a different result from mine above.

Remember: there’s nothing special about the particular split we have obtained here - we got it by randomly assigning each row to either the training or test data. To illustrate this, look what happens when we repeat the splitting process ten times and then calculate the score for each different random split. Notice that in this loop we are keeping k fixed at 10, and just doing the same experiment repeatedly:

for _ in range(10):
    # randomly assign the rows to either training or test
    df['assignment'] = np.random.choice(['training', 'testing'], 333, p=[0.8, 0.2])
    training_data = df[df['assignment'] == 'training']
    test_data = df[df['assignment'] == 'testing']
    
    # calculate score
    predictions = test_data.apply(
        lambda p : 
            guess_species(
                p['bill_length_mm'], 
                p['flipper_length_mm'], 
                10 # k is always 10
            )
        , axis=1
    )

    score = (predictions == test_data['species']).value_counts(normalize=True)[True]
    print(score)

0.9464285714285714
0.9444444444444444
0.9012345679012346
0.9482758620689655
0.9444444444444444
0.9571428571428572
0.9701492537313433
0.9452054794520548
0.88
0.9508196721311475

The scores vary a lot, from a near-perfect 97% accuracy to a much less impressive 91% accuracy. Thinking about it logically, we can see why this must be the case. In some iterations, due to random chance, we will get a training set and test set that are very similar, in which case the function will perform well. But in some iterations, the training and test sets will be very different, leading to a lower score. If we are very unlucky, we might get a split where most of the Gentoo penguins end up in the training set, but most of the Adelie penguins end up in the test set. Due to the simple species-counting logic that our function uses, it will very rarely predict Adelie.

Another way to visualise this randomness is to draw multiple prediction charts for different random selections of training data. We can use the code from the previous part, and put it in a loop:

# ignore some pandas warnings that we don't care about
pd.options.mode.chained_assignment = None  

# make 10 different prediction grids with the same value of k
bill_lengths = np.arange(35, 60, 1)
flipper_lengths = np.arange(170, 230, 1)

data = []
for iteration in range(10):
    
    # randomly assign the rows to either training or test
    df['assignment'] = np.random.choice(['training', 'testing'], 333, p=[0.8, 0.2])
    training_data = df[df['assignment'] == 'training']
    test_data = df[df['assignment'] == 'testing']
    
    for bill_length in bill_lengths:
        for flipper_length in flipper_lengths:
            guess = guess_species(bill_length, flipper_length, 3)
            data.append((bill_length, flipper_length, guess, iteration))
        
data = pd.DataFrame(data, columns=['bill_length', 'flipper_length', 'prediction', 'iteration'])

# plot the predictions on multiple charts
sns.relplot(
    data = data,
    x = 'bill_length',
    y = 'flipper_length',
    hue = 'prediction',
    hue_order = ['Adelie', 'Gentoo', 'Chinstrap'],
    col='iteration',
    col_wrap = 5,
    height=2,
    edgecolor = 'none'
)

Here I’ve used k=3 to make the effect easier to see in small charts. Notice how the outlines of the prediction regions are slightly different in each iteration, due to the random sampling of data points in the training set, even though we are keeping k exactly the same. Chapter 8 of Biological data exploration has details of how to create these kind of small multiple plots.

We will finish this part of the series here. In the next part, we’ll pick up the idea of scoring a classifier and look at a few ways to get round the problem of randomness. If you don’t want to miss it, sign up for the Python for Biologists newsletter just below this article.